A Short Introduction to the Pandas Package

Pandas is the Python Data Analysis Library from the makers of scipy, numpy, and IPython. It provides the perfect data structures for textual matrices, the Series and the DataFrame. Let's jump right in.


In [2]:
import pandas as pd #the recommended import condition from pandas
import numpy as np
sentence = 'the dog bit the man' #our first sentence from the presentation
token_list = sentence.split()
type_list = list(set(token_list)) #we only want each word type listed once
print(type_list)
print(token_list)


['man', 'the', 'bit', 'dog']
['the', 'dog', 'bit', 'the', 'man']

Now, let's initialize a Series with these index labels.


In [3]:
ser1 = pd.Series(index = type_list)
print(ser1)


man   NaN
the   NaN
bit   NaN
dog   NaN
dtype: float64

If we want to initialize it with values, we can add do this with the data argument or with a dictionary. First, let's get the counts for each word in the sentence.


In [4]:
count_list = []
count_dict = {}
for word in type_list:
    count_list.append(token_list.count(word))
    count_dict[word] = token_list.count(word)
print(count_list)
print(type_list)
print(count_dict)


[1, 2, 1, 1]
['man', 'the', 'bit', 'dog']
{'man': 1, 'the': 2, 'bit': 1, 'dog': 1}

Now, we can create our Series objects.


In [5]:
ser2 = pd.Series(data = count_list, index = type_list)
ser3 = pd.Series(count_dict)
print(ser2)
print(ser3)


man    1
the    2
bit    1
dog    1
dtype: int64
bit    1
dog    1
man    1
the    2
dtype: int64

A Series is essentially a labelled vector, here a frequency term-document vector. In order to construct a term-document matrix, we can create another Series for our second sentence.

Quiz:

Below, write a short function that takes as its input a string of text and outputs a dictionary of word counts and a term-document Series.


In [8]:
sent2 = 'the bat hit the ball'
from collections import Counter

def td_Series(text):
    # insert your code here to create a count dictionary and a term-document vector for sent2
    counter = Counter(text.split())
    return counter, pd.Series(counter)

count_dict2, ser4 = td_Series(sent2)
print(ser4)
print(count_dict2)


ball    1
bat     1
hit     1
the     2
dtype: int64
Counter({'the': 2, 'bat': 1, 'hit': 1, 'ball': 1})

At this point, we have two separate Series representing two different term-document vectors. We can bring them together to create a DataFrame, the primary object type in the Pandas package.


In [9]:
df1 = pd.DataFrame(data = [ser3, ser4], index = ['sent1', 'sent2'])
print(df1) 
# if you don't print the dataframe, it will give you a nice HTML formatted view on the table:
df1


       ball  bat  bit  dog  hit  man  the
sent1   NaN  NaN    1    1  NaN    1    2
sent2     1    1  NaN  NaN    1  NaN    2

[2 rows x 7 columns]
Out[9]:
ball bat bit dog hit man the
sent1 NaN NaN 1 1 NaN 1 2
sent2 1 1 NaN NaN 1 NaN 2

2 rows × 7 columns

Notice that we now have a $m \times n$ term-document matrix. We could also create the DataFrame by calling our count_dicts directly. In this DataFrame, let's replace all nan values with 0.


In [14]:
df2 = pd.DataFrame(data = [count_dict, count_dict2], index = ['sent1', 'sent2'])
print(df2)
df2 = df2.fillna(value = 0)
df2


       ball  bat  bit  dog  hit  man  the
sent1   NaN  NaN    1    1  NaN    1    2
sent2     1    1  NaN  NaN    1  NaN    2

[2 rows x 7 columns]
Out[14]:
ball bat bit dog hit man the
sent1 0 0 1 1 0 1 2
sent2 1 1 0 0 1 0 2

2 rows × 7 columns

And now we can call values simply by naming the row, column name pairs. Name the row first, then the column.


In [16]:
print(df1.ix['sent1', 'ball'])
print(df1.ix['sent2', 'ball'])
print(df2.ix['sent1', 'ball'])
print(df2.ix['sent2', 'ball'])
# or do it like this: 
print(df1.ball.sent1)
df1


nan
1.0
0.0
1.0
nan
Out[16]:
ball bat bit dog hit man the
sent1 NaN NaN 1 1 NaN 1 2
sent2 1 1 NaN NaN 1 NaN 2

2 rows × 7 columns

We can also call them by their row and column indices. Again, first row, then column.


In [17]:
df1.ix[0,0]


Out[17]:
nan

In [18]:
df1.ix[1,0]


Out[18]:
1.0

In [19]:
df2.ix[0,0]


Out[19]:
0.0

In [20]:
df2.ix[1,0]


Out[20]:
1.0

In [21]:
df2.ix[0]


Out[21]:
ball    0
bat     0
bit     1
dog     1
hit     0
man     1
the     2
Name: sent1, dtype: float64

In [22]:
df2.index


Out[22]:
Index(['sent1', 'sent2'], dtype='object')

In [23]:
df2.values # which will return a numpy 2d array


Out[23]:
array([[ 0.,  0.,  1.,  1.,  0.,  1.,  2.],
       [ 1.,  1.,  0.,  0.,  1.,  0.,  2.]])

Below are a few other things you can do with a DataFrame.


In [24]:
df2.min(axis = 0)


Out[24]:
ball    0
bat     0
bit     0
dog     0
hit     0
man     0
the     2
dtype: float64

In [25]:
df2.min(axis = 1)


Out[25]:
sent1    0
sent2    0
dtype: float64

In [26]:
np.min(df2, axis = 1) # numpy function works but is slightly slower


Out[26]:
sent1    0
sent2    0
dtype: float64

In [27]:
df2.max(axis = 1)


Out[27]:
sent1    2
sent2    2
dtype: float64

In [28]:
df2.idxmin(axis = 1) # index of the min


Out[28]:
sent1    ball
sent2     bit
dtype: object

In [29]:
df2.idxmax(axis = 1) # index of the max


Out[29]:
sent1    the
sent2    the
dtype: object

In [30]:
df2.values.max() # max of all of the values


Out[30]:
2.0

And simple statistics.


In [31]:
df2.describe()


Out[31]:
ball bat bit dog hit man the
count 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2
mean 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 2
std 0.707107 0.707107 0.707107 0.707107 0.707107 0.707107 0
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2
25% 0.250000 0.250000 0.250000 0.250000 0.250000 0.250000 2
50% 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 2
75% 0.750000 0.750000 0.750000 0.750000 0.750000 0.750000 2
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2

8 rows × 7 columns


In [32]:
df2.mean(axis = 1)


Out[32]:
sent1    0.714286
sent2    0.714286
dtype: float64

In [33]:
df2.ix['sent1'].mean()


Out[33]:
0.7142857142857143

In [34]:
df2.std(axis = 1) # standard deviation


Out[34]:
sent1    0.755929
sent2    0.755929
dtype: float64

Now, what can we do with this? We can use, e.g., the correlation metric in Pandas.


In [35]:
df2.irow(0).corr(df2.irow(1))


Out[35]:
0.12499999999999988

Or, if we have the scikit-learn package, there is a lot more we can do.</br>

Note: to install scikit-learn on Linux with Python 3.4, use the following command:</br> [sudo] pip3 install git+https://github.com/scikit-learn/scikit-learn.git.

The tf-idf metric stands for 'term frequency-inverse document frequency'. It weights the importance each word has for each document based on how often it occurs in the document and the inverse of how many documents contain it in the corpus.


In [36]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer().fit_transform(df2)
df_tfidf = pd.DataFrame(data = tfidf.toarray(), index = df2.index, columns = df2.columns)
print(df_tfidf)


           ball       bat       bit       dog       hit       man       the
sent1  0.000000  0.000000  0.446101  0.446101  0.000000  0.446101  0.634809
sent2  0.446101  0.446101  0.000000  0.000000  0.446101  0.000000  0.634809

[2 rows x 7 columns]

You can also measure the distance between two documents with the pairwise_distances function in sklearn.


In [37]:
from sklearn.metrics.pairwise import pairwise_distances
euclid = pairwise_distances(df2) #Euclidean distance between the two documents.
df_euclid = pd.DataFrame(data = euclid, index = df2.index, columns = df2.index)
print(df_euclid)


         sent1    sent2
sent1  0.00000  2.44949
sent2  2.44949  0.00000

[2 rows x 2 columns]

Quiz:
Now its your turn. There are many texts in the Data sub-directory in this directory. Write a function that takes as input a text file's path, reads the text from the file, splits it into its individual words, and a Series with the word types (i.e., unique words) as the index and the number of times they occur as the values.


In [41]:
def split_txt(filename):
    # write your code here
    with open(filename) as f:
        my_file = f.read()
    d, s = td_Series(my_file)
    return s

emma_Series = split_txt('./Data/austen-emma.txt')
print(emma_Series[:20])


"'Tis              1
"--Mrs.            1
"A                13
"A.                1
"About             1
"Agreed,           1
"Ah!              27
"Ah!"              3
"Ah!--(shaking     1
"Ah!--Indeed       1
"Ah!--so           1
"Ah!--well--to     1
"Ah,               2
"Almost            1
"And              45
"And,              3
"Another           1
"Are               4
"As                8
"At                1
dtype: int64

Take a look at the first 20 members of the Series. It looks like we have a couple of problems: capitalization and punctuation. Edit your function below to solve these problems.
Hint: use the punctuation constant in the string library to recognize punctuation.


In [42]:
from string import punctuation
print(punctuation)


!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

In [45]:
from string import punctuation
import re
def split_txt(filename):
    # write your code here
    with open(filename, encoding = 'utf-8') as f:
        my_file = f.read()
    #lowercase here
    my_file = my_file.lower()
    #strip punctuation here
    for c in punctuation:
        my_file = my_file.replace(c, ' ')
    d, s = td_Series(my_file)
    return s

emma_Series = split_txt('./Data/austen-emma.txt')
'''
The following code checks whether you have successfully cleaned your corpus.
Please do not change it.
'''
problems = []
for word in emma_Series.index:
    if re.search('[\WA-Z]', word):
        problems.append(word)
print(len(problems))


0

In [46]:
series = {}
for f in files:
    split_txt(f)
    series[f] = split_txt(f)


000             2
10              2
1816            1
23rd            1
24th            1
26th            1
28th            2
7th             1
8th             1
a            3130
abbey          31
abbots          1
abdy            1
abhor           1
abhorred        1
abide           1
abilities       3
able           72
abode           1
abolition       1
dtype: int64

If the length of the problems list is not 0, then you are not yet finished. Take a look at your results to check what you did wrong and edit your code to correct the problem.

You now have a function that can take a text, clean it, and produce a term-document array (Series). Now, you should integrate this function into a script that will read and clean all the texts in the ./Data folder. You should then integrate all of the resulting Series into one large term-document matrix. Transform this matrix into a tf-idf matrix, and then run at least 5 of the metrics under pairwise_distances in sklearn.


In [47]:
from os import listdir

texts = listdir('./Data')
texts


Out[47]:
['austen-emma.txt',
 'austen-pride.txt',
 'austen-sense.txt',
 'blake-poems.txt',
 'blake-songs.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-piazza.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'whitman-leaves.txt',
 'whitman-patriotic.txt',
 'whitman-poems.txt']

In [48]:
#Write your code here or in a separate .py file. __Make sure I know where to find your file!__
from os import listdir

texts = listdir('./Data')
s_list = []
for f in texts:
    s_list.append(split_txt('/'.join(['./Data', f])))
td_df = pd.DataFrame(s_list, index = texts).fillna(0)
td_df


Out[48]:
 0 00 000 00021053 00081429 00482129 01 02 1 10 100 1000 10000 100° 11 112 113 116 118
austen-emma.txt 0 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 ...
austen-pride.txt 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 ...
austen-sense.txt 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 1 0 0 0 0 ...
blake-poems.txt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
blake-songs.txt 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ...
bryant-stories.txt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
burgess-busterbrown.txt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...
carroll-alice.txt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
chesterton-ball.txt 0 0 1 1 0 0 0 1 2 7 1 2 1 1 0 1 0 1 0 0 ...
chesterton-thursday.txt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
edgeworth-parents.txt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
melville-piazza.txt 0 0 0 6 0 0 0 0 0 2 1 0 0 0 0 1 0 0 0 0 ...
milton-paradise.txt 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
shakespeare-caesar.txt 0 0 0 0 0 0 0 0 0 19 0 0 0 0 0 0 0 0 0 0 ...
shakespeare-hamlet.txt 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 ...
whitman-leaves.txt 0 1 0 0 0 0 0 0 0 36 7 0 0 0 0 7 0 0 0 0 ...
whitman-patriotic.txt 0 0 0 0 0 0 0 0 0 10 3 0 0 0 0 3 0 1 1 1 ...
whitman-poems.txt 0 0 0 1 0 0 0 0 0 83 7 1 0 0 1 6 0 0 0 0 ...

18 rows × 32519 columns


In [51]:
len(td_df.ix[0])


Out[51]:
32519

Consider your results from each of these different metrics. Is there anything that suggests which of these metrics are better for analyzing this data?

Write your answer in this text box, below this line.

Your answer:

blahblahblah